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It is shown that, for kernel-based classification with univariate 
distributions and two populations, optimal bandwidth choice has a 
dichotomous character. If the two densities cross at just one point, 
where their curvatures have the same signs, then minimum Bayes risk 
is achieved using bandwidths which are an order of magnitude larger 
than those which minimize pointwise estimation error. On the other 
hand, if the curvature signs are different, or if there are multiple 
crossing points, then bandwidths of conventional size are generally 
appropriate. The range of different modes of behavior is narrower in 
multivariate settings. There, the optimal size of bandwidth is gen- 
erally the same as that which is appropriate for pointwise density 
estimation. These properties motivate empirical rules for bandwidth 
choice. 

1. Introduction. 

1.1. Motivation and main results. A common approach to nonparamet- 
ric classification based on data from training samples is to construct non- 
parametric estimators of population densities and substitute them for the 
true densities in a theoretically optimal algorithm for minimizing Bayes risk. 
Not only is this approach intuitively appealing and operationally straight- 
forward, it is optimal in a minimax sense, as argued by Marron (1983). 
However, it is unclear how one might select a bandwidth that minimizes 
risk. In particular, we might ask from a theoretical viewpoint what relation- 
ship exists between the sizes of bandwidth that are appropriate for pointwise 
density estimation and for optimal classification. And even if we understand 
this connection, and have a theoretically optimal formula for bandwidth, 
how might we go about constructing empirical approximations to it? 
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In this note we briefly summarize how bandwidth choice influences clas- 
sification error, and suggest ways of choosing bandwidth to minimize that 
error. In particular, we show that when only two populations are involved, 
when the populations are univariate, and when the densities intersect at a 
single point, the following dichotomous result arises. If the density curva- 
tures are of different signs at the crossing point, then minimum Bayes risk 
is achieved using bandwidths that are of the same sizes as those which min- 
imize pointwise estimation error. On the other hand, if the curvatures are of 
the same sign, then quite different bandwidth sizes, in fact, similar to those 
that would be employed if the kernel was of fourth (rather than second) or- 
der, are appropriate. Furthermore, if there is more than one crossing point, 
then, generally speaking, the first of these two sizes of bandwidth applies. 

Ironically, the problem actually becomes simpler in more complex set- 
tings, where the classification problem involves multivariate data. There, it 
is generally the case that the optimal size of bandwidth (in the sense of 
minimizing Bayes risk) is the same as that which would be used if we were 
constructing pointwise density estimators. 

The problem of empirical bandwidth choice suffers from unexpected dif- 
ficulties. It might reasonably be thought that leave-one-out methods, which 
have been so successful in related problems of nonparametric inference [see, 
e.g., Hall (1983), Stone (1984), Hardle and Kelly (1987) and Gyorfi, Kohler, 
Krzyzak and Walk (2002)], would perform well in this setting. For example, 
one could compute the estimate of classification error when a given datum X 
was omitted from the sample, evaluate the estimate at X , and then average 
over all values of X in order to obtain an estimate of classification error that 
could be minimized with respect to bandwidth. However, we shall show that 
this generally gives poor performance. The reason is that it depends on prop- 
erties of density estimators at the relatively small number of places where 
the true densities cross, and the leave-one-out approach described above 
does not give consistent estimates of error at individual points such as x; it 
is necessary to average over a continuum of points in the neighborhood of x. 
The extra degree of smoothing required by this step complicates inference, 
with the result that alternative approaches are relatively attractive. 

1.2. Relationship to literature. The extensive literature on this topic in- 
cludes results which, at first sight, might appear to be contradictory. For 
example, it is known that, while there exists a class of universally consistent 
classifiers [see, e.g., Lugosi and Nobel (1996)], the convergence rate of any 
classifier can be arbitrarily slow [Devroye, Gyorfi and Lugosi (1996), Chap- 
ter 7 and Yang (1999a)]. Indeed, arbitrarily slow rates can apply even for 
smooth densities [Devroye (1982)]. Moreover, while for large classes of densi- 
ties (e.g., monotone ones) the rate of convergence of the risk for classification 
is strictly faster than that for estimation, the two problems are, in fact, of 
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the same difficulty in a well-defined sense [Yang (1999a)]. Also, although 
the risk of members of a popular class of classifiers converges to its asymp- 
totic limit at rate n~ 2 , where n denotes sample size [Cover (1968)], that for 
classifiers based on empirical forms of Bayes risk converges no more quickly 
than n _1 , even in parametric settings [e.g., Kharin and Ducinskas (1979)]. If 
Bayes risk-based classifiers use kernel estimators, or related nonparametric 
methods based on places where densities cross, then they converge at slower 
rates than n~ l , which are nevertheless minimax-optimal [e.g., Marron (1983) 
and Mammen and Tsybakov (1999)]. 

Such contrasts, particularly those between results of Lugosi and Nobel 
(1996) and Devroye, Gyorfi and Lugosi (1996), or among the convergence- 
rate results noted by Yang (1999a), are particularly engaging, but, of course, 
do not amount to contradictions. Differences among minimax results can be 
accommodated by noting that the classes over which the "max" part of 
"minimax" is taken are not identical. There is no real conflict between the 
results of Cover (1968) and those for Bayes risk-based methods, since the 
limiting risk of the nearest-neighbor methods treated by Cover is (except 
in degenerate cases) strictly greater than the Bayes risk, and so the fast 
convergence rate does not imply good performance. 

Work in the present paper relates to kernel-based methods for classifi- 
cation, which date from contributions of Fix and Hodges (1951). It is less 
closely connected to classification problems involving very high-dimensional 
data; for the latter setting, see, for example, Breiman (1998, 2001), Schapire, 
Freund, Bartlett and Lee (1998), Friedman, Hastie and Tibshirani (2000), 
Kim and Loh (2001), Dudoit, Fridlyand and Speed (2002) and Jiang (2002). 
Although there is some evidence that multiplicative bias/variance decom- 
positions play an important role in such contexts, considerable interest still 
resides in additive decompositions of the type addressed in the results we 
shall discuss. For example, in a wide-ranging contribution to classification 
problems for multivariate (and, in particular, high-dimensional) data, Fried- 
man [(1997), Section 11] draws particular attention to the role of additive 
decompositions in classification problems. 

In addition to the work discussed above, there is an extensive literature 
on nonparametric methods for classification, much of it based on using an 
empirical version of the Bayes-optimal rule. Fukunaga and Hummels (1987) 
and Psaltis, Snapp and Venkatesh (1994) extend Cover's (1968) work to 
d dimensions, where the classification error of nearest-neighbor methods 
converges at rate n~ 2 / d . Efron (1983) and Efron and Tibshirani (1997) dis- 
cuss the performance of bootstrap-based estimators of error rate for general 
classification methods. Chanda and Ruymgaart (1989) address kernel-based 
classification rules when the two distributions differ only in location, and 
where tails decrease exponentially fast or in a regularly varying manner. See 
also Kharin (1983), who gives related results in multivariate settings, and 
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Devroye, Gyorfi and Lugosi [(1996), Theorem 6.6], who provide an elegant 
upper bound. Krzyzak (1991) derives bounds on Bayes probability of er- 
ror for kernel-based classification rules; Lapko (1993) gives a book-length 
account, in Russian, of nonparametric classification, including techniques 
based on nonparametric density estimation; Pawlak (1993) proposes kernel- 
based classification rules for use with incomplete data; Lugosi and Pawlak 
(1994) describe properties of a posterior-probability estimator of classifica- 
tion error for nonparametric classifiers; Ancukiewicz (1998) introduces class- 
based classification rules founded on nonparametric density estimators; Yang 
(1999b) studies nonparametric estimation of conditional probability for clas- 
sification; Baek and Sung (2000) introduce a nearest-neighbour search al- 
gorithm for nonparametric classification; Steele and Patterson (2000) give 
formulae for exact calculation of bootstrap estimates of expected prediction 
error for nearest-neighbor classifiers; and Lin (2001) suggests a nonparamet- 
ric classification rule for univariate data, based on the minimum Kolmogorov 
distance between two populations. 

1.3. Summary. Section 2 presents our main results in the univariate, 
two-population case, where at least one of the densities is not close to zero. 
Section 3 suggests ways of removing the latter constraint; Section 4 treats 
empirical choice of bandwidth; Section 5 addresses generalizations to mul- 
tiple and multivariate populations; and Section 6 outlines numerical prop- 
erties. For the sake of brevity, most proofs are omitted, being available in a 
longer version of the paper, available online [Hall and Kang (2002)]. How- 
ever, a brief account of the reasons for failure of leave-one-out methods is 
given in Section 7. 

2. Classifying data from the body of a distribution: two-population case. 

2.1. Kernel-based classifiers. Let the two populations have distributions 
F and G, with respective densities / and g. Let < p < 1 reflect the prior 
probability that a new, unclassified datum, x say, lying in a given inter- 
val I, is drawn from F. (To avoid degeneracy we assume throughout that 
<p< 1.) Denote by Aq the "ideal" algorithm that classifies x as coming 
from F or G according as A(x) = pf(x) — (1 —p)g(x) is positive or negative, 
respectively. [We may make the classification arbitrarily if A(x) vanishes.] 
Among all measurable algorithms A for classification on I, Ao is optimal in 
the sense of minimizing the Bayes risk 



err A (f,g\l) 




p / P(x is classified by A as coming from g)f(x)dx 

r 

+ (1 — p) / P(x is classified by A as coming from f)g(x) dx. 
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Optimality requires that prior probabilities for F and G, restricted to I, be 
precisely p and 1 — p, respectively, although this assumption will not be a 
prerequisite for our main theoretical results. 

Given training datasets X = {X±, . . . , X m } and y = {Yi, . . . ,Y n } drawn 
from F and G, respectively, an empirical version of Ao may be based on 
nonparametric density estimators, / and g say, computed from X and y. 
Specifically, given a nonnegative kernel K and bandwidths h\,h-2 > 0, let 

and let A\ be the rule that classifies x as coming from F or G, according as 
A (a;) =pf(x) — (1 — p)g(x) is positive or negative, respectively. 

Classification can be made arbitrarily if A(x) = 0. However, in this case 
a distinction should be drawn between cases where at least one of f(x) 
and g(x) is nonzero and where f(x) and g(x) both vanish. In the latter 
setting classification can be more prone to error. An alternative algorithm, 
not employing arbitrary choice, will be suggested in Section 3. 

2.2. Main results. We shall assume the following: 

(2o3/)i is bounded away from zero and infinity as n — > oo; 

,^ and g have two continuous derivatives and are bounded away from 
zero in an open interval containing X; 

A vanishes at just v > 1 points, y%, . . . ,y v , in I, all of them interior points 
and at each of which A'(yj) ^ 0; 

(2\T6)s a bounded, symmetric and compactly supported probability density; 
(Sof)j = 1 and 2, hj = hj(n) x n~ p as n — > oo, where < p < 1. 

The notation a(n) >c bin) means that the ratio of left- and right-hand sides 
is bounded away from zero and infinity as n — > oo. The equivalence of band- 
width sizes which (2.7) entails is not strictly necessary, but since optimal 
bandwidths satisfy (2.7), then it is imposed without loss of generality. Put 
h = n~ p , where p is as in (2.7). 

Our proof of Theorem 2.1, stated below, needs only two (or four, in the 
case of the second half of the theorem) continuous derivatives of / and g 
in neighborhoods of a cross-over point, together with continuity of / and g 
in an open interval I op containing 2, as asked by (2.4). However, (2.4) is 
a standard condition when analyzing performance of second-order density 
estimators, and two bounded derivatives are required for the minimax results 
of Marron (1983). 
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Theorem 2.1. Assume <p< 1 and I is a compact interval, and that 
(2.3)-(2.7) hold. Then, 

err^ (/, 5] J) - err^ (/, 

(2.8) = 1 £ ! A'iyj^Eipfte) - (1 - p)^)} 2 + oIK)" 1 + Z* 4 }- 

// m addition u = l, f"(yi)g"(yi) > 0, 

( , 9) ^.l 7T irM\ l '\ o[h% 



hi {(l- P )g"(yi) 

and f and g each have four continuous derivatives in a neighborhood of y\ , 
then (2.8) continues to hold if the remainder there is replaced by o{{nh)~ l + 
h 8 }. 

Chanda and Ruymgaart (1989) give a version of (2.8) in cases where 
g differs from / only in location, and tails are controlled by specific decay 
assumptions. Result (2.8) is specific to kernel-based Bayes classifiers. Indeed, 
as we noted in Section 1.2, Cover (1968) has shown that much faster rates 
are possible for nearest-neighbor classifiers, for which the asymptotic risk 
usually dominates the Bayes risk err_4 (/, g\T). 

An alternative algorithm is that suggested by Stoller (1954), and involves 
classifying a new data value x as coming from / if x < argmax(mF — nG), 
where F and G are the empirical distribution functions computed from 
X and y, respectively. Here the classification probability, for data in I, 
converges to err - 4 (/, g\I), but only at rate O p (n~ 1 / 2 ). 

2.3. Implications of Theorem 2.1. The expansion at (2.8) may be refined 

to 

(2.10) err^ (f,g\I) - err A) (/, g\l) = B^nh)' 1 + B 2 h 4 + o^n/i)" 1 + h 4 }, 

where B\ and B2 are both functions of H\ = h\/h and H2 = h,2/h, and, 
explicitly, 

b 1 = \kY j lA'tMTHtrBiTVfto) + H2H1 -vfg{yj)}, 

(2.11) j= l 

B 2 = 14 WtoTHBiPf'toi) ~ Hi(l-p)g"( yj )} 2 , 
i=i 

with k = J K 2 , Kj = J v?K(u) du and r = m/n. Result (2.10) implies that the 
optimal bandwidth is of size n -1 / 5 (i.e., p= 1/5), and that optimal values 
of the constants H± and H 2 are obtained by minimizing B\ + B 2 , unless it 
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should be possible to render B2 = by some positive, nonzero choice of H\ 
and i?2- 

If v = 1, then B2 = is possible (for positive H\ and H2) if and only if 
f"(yi) and g"(yi) are of the same sign; that is, the densities at the point y\ 
where pf and (1 — p)g cross are either both locally concave or both locally 
convex. Assuming this to be the case, and choosing h\ and /12 as at (2.9), we 
may show from (2.8) (with h 8 instead of h 4 in the remainder) that, instead 
of (2.10), 

(2.12) err Al (f,g\l) - err Ao (f,g\l) = Bsinh)- 1 + B 4 h s + oiinh)' 1 + h 8 }, 
where, defining R = pf"(yi)/(l — p)g"(yi), we have 

£3 = T^rlA'imTHr-VHvi) + R~ 1/2 (i-p) 2 g(yi)}, 



(2.i3) 

B 4 = ^lA^I^Cp/Wdft) - R\l-p)g^( yi )} 2 . 



Result (2.12) implies that the optimal bandwidth is now of size n 1 ' 9 (i.e., 
p= 1/9), and that the optimal constant Hi is obtained by minimizing B3 + 
B A . 

There is, of course, a possibility that the factor T(f,g) = pf^\yi) — 
R 2 (l — p)g^(yi) appearing in the definition of B4 vanishes. In this case the 
term in B^h s at (2.12) should be replaced by one in h 12 , and the remainder 
replaced by o{(nh)~ l + h 12 }, provided / and g have continuous derivatives 
of order 6 in a neighborhood of y\. However, since T(f,g) is a particularly 
unusual functional of second and fourth derivatives of two distinct densities, 
then it is unlikely that in practice T(f,g) = 0. 

In summary, excepting pathological cases that can be expected to arise 
only rarely, the optimal bandwidths for classification when v > 2 are h® = 

Hjn" 1 / 5 , where H\,H2 > are chosen to minimize 

(2 14) E ^'(yjT^KrHj-yfiyj) +H^(1 - P ) 2 g( yj )} 

+ l A{Hlpf"(y 3 ) - H 2 (l - p)g"( yj )} 2 }. 

If /' '{yijQ 1 '(yi) < 0) then this prescription is also valid for v = 1. How- 
ever, if v = 1 and f"(yi)g"(yi) > 0, then, excepting pathological cases where 
T(f,g) = 0, the optimal bandwidths are hi = H\n~ 1 ^ and /i^ = H2n~ 1 / 9 = 
HiR 1 / 2 ^ 1 / 9 , where Hi > minimizes 

-^-{r~ 1 P 2 f(yi)+R~ 1/2 (i -p) 2 g(yi)} 

(2 15) 2 8 

+ ^{ P f {4) (yi)-R 2 (i-P)g {4) (yi)} 2 - 
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An extreme case is that where A is smooth and vanishes over a "plate," 
that is, a nondegenerate interval J = [a, b\. Then, each derivative of A which 
exists must vanish on J . Therefore, if no discontinuities of derivatives enter 
into the determination of properties of A, the problem of estimating the 
endpoints of J is essentially parametric. Provided there are no other points 
where A vanishes, then it may be shown that under appropriate regularity 
conditions, an empirical rule can get within 0(n _1 ) of err^p (/, g\T). 

The setting where Bayes risk equals zero is sometimes addressed in the 
context of machine learning [see, e.g., Ehrenfeucht, Haussler, Kearns and 
Valiant (1989)]. Excluding the uninteresting degenerate case in which p(l — 
p) = 0, and pathological cases where the support of / starts exactly at a 
point where that of g ends (or vice versa), this setting entails A vanish- 
ing on a plate, as discussed in the previous paragraph. Therefore, its main 
implications are those that we have discussed previously. 

In many circumstances the discussion of classification given following The- 
orem 2.1 applies in a general, global sense, to an empirical algorithm A 
applied to any new datum i£l, rather than only to the algorithm A\ 
restricted to T. Details will be given in the next section. 

3. Classification in the tails. 

3.1. Kernel-based classifiers. We shall assume that the supports of both 
/ and g are intervals, that neither density vanishes in the interior of its 
support, and that a classification rule is sought in the upper tail. In this 
instance our algorithm will be based on the assumption that, sufficiently 
far to the right, the tail of / exceeds that of g, or vice versa. Formally, 
we ask that either f(po) ^ £7(^0 for all x G (^o^supp 

), or g(x) > f{x) for all 

x € (xq, x sup p), where xq is strictly less than the right-hand end, x supp , of 
the support of / or g, respectively; and we seek a means of classifying new 
data x > xq. Of course, x supp may be infinite. 

If x > xq and f(x) = g(x) = 0, let x denote the infimum of values of y < x 
such that f(z) = g(z) = for all z € [y, x\. Our algorithm, to which we refer 
below as An, where the subscript indicates the right-hand tail, consists of 
classifying x as coming from / or g, according, as f(x—) > or g(x—) > 0. 
[With probability 1, exactly one of f{x—) and g(x—) will be nonzero.] 

3.2. Main results. Theorem 3.1 below shows that the suboptimality level 
discussed in Section 2, that is, 0(n"^~ p ^) where p = 1/5 or 1/9, is preserved 
if the upper tail weights of / and g are sufficiently different. Theorem 3.2 
demonstrates by example that if the tail weights are too close, then the level 
of suboptimality can be of strictly larger order than n~^~ p \ 
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Next we give regularity conditions for Theorem 3.1. Writing F and G for 
the distributions corresponding to densities / and g, respectively, we ask 
that: 



is a bounded, symmetric, compactly supported and Holder continuous 



probability density; 
(3®^ j = 1 and 2, hj = hj(n) x n~ p as n — > oo, where < p < 1; 
(3.j8)and g are continuous, and strictly decreasing in their upper tails; 
(3f4>^ a constant Ay > and all sufficiently large x, A\f(x) > f(x — x _1 ); 
(3f6) each e > and all sufficiently large x, ef(x) > g(x — x^ 1 ); 
(3fB)r A 2 > 0, for a > and all sufficiently large x,l- G(x) < A 2 f{x) a ; 



Assumption (3.1) is satisfied by compactly supported kernels commonly used 
in practice, and, in particular, by the Epanechnikov, biweight and triweight 
kernels; condition (3.2) is satisfied by the optimal bandwidths discussed in 
Section 2; (3.3) asks that the tails of / and g be smooth and eventually 
decreasing; (3.4) asks that the tails of / not decrease too rapidly, and is 
satisfied by the majority of distributions that have infinite tails to the right; 
(3.5) asks that / eventually dominate g; (3.6) asserts that this domination 
is sufficiently great; and (3.7) holds if the lighter-tailed distribution G has 
finite moment of order (2 — p)j p. 

Theorem 3.1. // (3.1) -(3.7) hold, then for some xq > 0, 



as n — > oo . 

Next we investigate an instance where / and g both have Pareto-type 
tails, but the tail weights are sufficiently similar for the algorithm to 
have difficulty distinguishing between them. Specifically, assume that 




(3^f-pyp{l - G(x)} -» as x -» oo. 



(3.8) 



P{for each x > Xq, one of the following two properties 
holds: (a) pf(x) > (l-p)g(x), or (b) f(x) = g{x) = 0, 
g{y) = for all y > x, f(x—) > and g{x—) = 0} 
= 1 -oiinh)- 1 } 



(3.9) 



f{x) ~ ax 



—a 



and g(x) ~ bx 



as x 



DC 



where a, b > and l<a</3<a + l<oo. 



Let Ai = A\ U ^4r denote the algorithm constructed by using A\ to classify 
x if not both of f(x) and g(x) vanish, and using Ar otherwise. 
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Theorem 3.2. // (3.1), (3.2) and (3.9) hold, then for all sufficiently 
large Xq, 

POO 

(3.10) nh I P(x is classified by A2 as coming from g)f(x)dx — > 00 

Jxo 

as n — > 00 . 

3.3. Implications of Theorems 3.1 and 3.2. An immediate consequence 
of Theorem 3.1 is that if (3.1)-(3.7) hold, then the probability that, uni- 
formly in new data x on [xq,oo), A2 is equivalent to classifying in the opti- 
mal way using Ao, equals 1 — o{(n/i) -1 }. Therefore, taking the classification 
interval to be I = [xo,oo), we deduce that 

(3.11) err A2 (f,g\l)-err Ao (f,g\Z) = o{(nh)- 1 } 

as n — > 00. The left-hand side of (3.11) is of course nonnegative; it represents 
the Bayes risk for an empirical classification rule, minus the risk for the 
optimal rule. 

There is, of course, a version of Ar for the left-hand tail; call it Ah- Let A 
denote the algorithm that classifies x using A\ if f(x) and g{x) do not both 
vanish, or using Ar if f{x) = g{x) = and x lies to the right of the median 
of X U y, or using Al otherwise. (Our choice of the median is arbitrary.) 
Assume / and g are continuous on the real line, that the supports of / and g 
are intervals, that neither density vanishes at any point in the interior of its 
support, that the conditions of Theorem 2.1 hold on any compact interval 
I that is interior to the intersection of the supports, that the conditions of 
Theorem 3.1 (possibly with / and g interchanged) hold to the right, and 
that the analogous conditions hold to the left. Then in either tail, either / 
or g dominates the other, and so there can be only a finite number of points 
{y, say) at which the graphs of pf and (1 — p)g cross. 

In these circumstances we may deduce from Theorems 2.1 and 3.1 that 
the expansions of classification error described in Theorem 2.1 hold for the 
algorithm A applied to classification on the whole real line M: 

err-(f,g\R)-err Ao (f,g\R) 

(3.12) = 1 £ | A / (y . )r l S{p /( y .) _ (1 _ p )g( y] )f + oUnh)- 1 + h*}. 

i=i 

The remainder term here can be sharpened to o{(n/i) _1 + h 8 } if the condi- 
tions of the second part of Theorem 2.1 apply, in particular, if h\ and /12 
satisfy (2.9). 

In view of these results, the discussion of optimality given following The- 
orem 2.1 applies to the present general, global setting, where A is used to 
classify any real- valued datum x. The asymptotically optimal bandwidths 
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are either h3 = Hjn x / 5 or h3 = Hjn 1 > 9 , where (H]_,H.2) minimizes either 
(2.14) or (2.15), respectively, and H2 = R l ^ 2 H\ in the latter case. 

It may be proved from (3.9) that if xq is sufficiently large, then (3.11) 
fails. Therefore, if the bandwidths hi and hi are chosen so as to minimize 
the inherent additional classification error in the body of the distribution, 
relative to the optimal algorithm Aq, this performance will not be reflected 
when using A2 to classify data in the tails. If (3.9) holds, then the additional 
error introduced by the difficulty of classifying data in the tails is so large 
as to dominate the relatively low levels of error (in comparison with *4o) 
experienced elsewhere. 

The rate of divergence in (3.10) can be arbitrarily slow, in the sense that 
for any given e > there exist densities / and g satisfying (3.9) and for 
which the left-hand side of (3.10) diverges to infinity more slowly than n £ , 
as n — > 00. 

Work of Chanda and Ruymgaart (1989) provides some further detail re- 
lated to Theorem 3.2. Addressing the case where / and g differ only in 
location, and the density tails decrease like x -7 as x — > 00, Chanda and 
Ruymgaart show that the difference between the error of the empirical clas- 
sifier and its asymptotic limit is of size (n/i)~ 7 ^ 7+2 ^ . Moreover, if the density 
tails decrease like e _x7 , then the rate 0(n _4//5 ) is possible if 7 > 1, although 
a slower rate occurs if 7 < 1. 

4. Empirical choice of bandwidth. 

4.1. Discussion of methods. We could compute bandwidths by construct- 
ing empirical approximations to the functions appearing in (2.14) and (2.15), 
finding the minima of empirical forms of those expressions and substitut- 
ing the resulting values into formulae for theoretically optimal bandwidths. 
However, this technique is awkward to use, since it requires explicitly work- 
ing out how many times the graphs of pf and (1 — p)g cross and where 
the crossings take place. This calls for technology similar to bump hunting 
methods. The relative complexity of that approach motivates alternative, 
more implicit techniques for bandwidth selection. One possibility is cross- 
validation, which at first sight seems very attractive. 

A cross-validation method for choosing bandwidth is as follows. Let /_j 
and g-i denote the respective versions of / and g, defined at (2.2), that 
are obtained through computing the latter estimators from the leave-one- 
out datasets Xi = X\{Xi} and 3^ = 3A{^}, respectively. (We continue to 
use respective bandwidths h\ and /i2-) Put A^_j = pf-i — (1 —p)g, = 
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Pf ~ 0-~P)g-i and 



efr^ (hi,h 2 ) = - E J{A/,-iCXi) < 0, X, e 1} 



(4-1) ""f 1 . 



+ — y2i{A g ^ i (Y i )>o,Y i ei}. 

n r— ; 

One might choose = (^1,^2) to minimize err^^/ii,/^). The latter 

may be viewed as an empirical approximation to err^^/, g|Z). However, this 
approach performs poorly in both theory and practice, and, in particular, 
does not accurately estimate, in the sense of relative consistency, the value 
of (hi,h2) that minimizes en Al (f,g\I). See Section 7 for details. 

A second, more effective approach, which we shall consider in detail, is 
based on using the bootstrap to estimate en^ 1 (f,g\l) and, thereby, to select 
the optimal bandwidths. Specifically, let / and g be the versions of / and g, 
defined at (2.2), that arise if we use respective bandwidths /13 and h& (instead 
of h\ and h 2 )- Conditional on X (or on y), draw m data X* = {XI, . . . , X^} 
independently and uniformly from the distribution with density / (or, resp., 
n data y* = {Y{ , ■ ■ ■ ,Y*} independently and uniformly from the distribution 
with density g), and let 

X i\ *, , 1 ^ f x -Y- 



mh i ~r[ \ h i J nh 2 
Put A*(x) = pf*(x) - (l-p)g*(x) and 

&r Al (h 1 ,h 2 )=p J P{A*(x) <0\Xuy}f(x)dx 



+ {l-p) J P{A*(x) >0\XU y}g(x)dx. 



Choose (hi,h 2 ) = [hi,h 2 ) to minimize err - 4 1 (hi, h 2 ). 

In the two respective cases we need to choose h% and /14 so that the "pi- 
lot" density estimators / and g are able to consistently estimate second, or 
fourth, derivatives of / and g. It is known from more conventional applica- 
tions of curve estimation that this requires /13 and h& to be of strictly larger 
order than n" 1 / 5 or n" 1 / 9 , respectively. Therefore, we should choose ^13 and 
/14 to both be of size n~ a , where in the first regime < a < g and in the 
second < a < g. Since taking < a < g covers both cases, then, for sim- 
plicity, we shall make that assumption in our theoretical results below. For 
the same reason we shall assume four derivatives of / and g in the neighbor- 
hood of each cross-over point, although in the case of the first regime only 
two derivatives are required. 
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4.2. Main results. We shall assume the following: 

/ and g are continuously differentiable and are bounded away from zero 
(4^f an °P en m t erva l containing I; A vanishes at just v points, j/i, . .. ,y v , 
ml, all of them interior points and at each of which A'(yj) ^ 0; and / 
and g each have four continuous derivatives in neighborhoods of each yj; 

either v > 1, f" {yj)g" {yj) ^ for at least one j, and the v equations 
(4.j3jf // (yj) — R(l — p)g"{yj) = do not have a simultaneous solution 
R > 0; or v = 1 and f" {yj)g" {yj) > 0, in which case a solution exists; 

(4.?4i)/n is bounded away from zero and infinity as n — > oo; 

. K is a compactly supported function with four Holder continuous deriv- 
atives on the real line and satisfying J K = 1; 

(4.&)r j = 3 and 4, /ij = hj(n) x n~°" as n — > oo, where < a < 1/15. 



Condition (4.3) implies that one or other of the two main regimes of be- 
havior of and h® obtains. If p = ^ or ^ in the two respective cases, then 
the optimal bandwidths are h® ~ Hjn" p for j = 1,2, where H\ and H2 are 
positive constants. 

Given 0<ci<g<g<C2<l, let (/ii,/i2) = (^1^2) denote the band- 
width pair that minimizes err^j {h\,h2) over {h\, h<i) such that n _C2 <h\,h2 < 
n~ ci . The theorem below shows that each empirical bandwidth /ij is asymp- 
totic to its asymptotically optimal counterpart h®. In addition, if a suffi- 
ciently high-order kernel is used to estimate / and g, then an empirical form 
of (2.9) holds. 

Theorem 4.1. Assume <p < 1 and 2 is a compact interval, and 
that (4.2)-(4.6) hold. Then, for j = 1 and 2, hj/h® — > 1 in probability as 
n — ► 00. Furthermore, if K is of order r, meaning that f u J K(u)du = /or 
1 < j < r — 1, if r > 2 /(5a), if the second part of (4.3) obtains, and if f and 
g have r + 2 bounded derivatives in a neighborhood of y\ , then the following 
empirical form of (2.9) holds: 

h _ I pf'lvi) 1 1/2 , . 



5. Multiple or multivariate populations. 

5.1. Multiple univariate populations. Suppose there are N distributions, 
Fx,... , F/v say, with respective densities /1, . . . , /jv and prior probabilities 
Pi,...,Pn, where J2jPj = !• Let ^4 denote a general algorithm for classifying 
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data in a given interval X. The "ideal" algorithm which minimizes the Bayes 
risk 

err^/i,...,/^^) 

N 

= ^^Pj I P(x is not classified by A as coming from fj)fj(x)dx, 
j= i Jx 

is the classification rule .4,0 which declares x to have come from fj \ipjfj{x) = 
maxfc{pfc/fc(x)}. (Ties may be broken at random.) Here it is assumed that 
the prior probabilities for /i, . . . ,/jv, restricted to X, are p\, . . . ,pn, respec- 
tively. 

Assume that for each 1 < j < N, we have access to a sample Xji, . . . , Xj nj 
of independent and identically distributed data drawn from distribution Fj . 
Assume the samples are themselves independent. Construct the density es- 
timator 



"3 '"3 »=l 

where hj is a bandwidth. Let A\ denote the empirical algorithm which de- 
clares x to have come from fj if and only if Pjfj = maxk{pkfk( x )}- (Breaking 
ties at random in this rule has no effect on our asymptotic results, provided 
m&XjPjfj is bounded away from zero on I.) 

Let I denote a compact interval, and assume maxjPjfj is bounded away 
from zero in an open interval containing I; that Ajj =Pifi ~Pjfj vanishes 
only at discrete interior points y^ of I, where 1 < k < V{j and A^ (yyfc) ^ 
0; that these points are distinct, in the sense that yi 1 j 1 k 1 = Viinki implies 
{h, ji} = {i2,j2} and k\ = k2\ that n\ — > oo and each ratio rij/n^ is bounded; 
that for each 1 < j < N, hj = hj{rii) x n 1 as n\ — > oo; and that other 
conditions, for example, on the smoothness of each /j, are analogous to 
those in Section 2. Put n = n\, Hj = n l /^hj and rj = rij/n, let k and «2 be 
as in Section 2, and define 

T(H\, . . . , Hn) 

= jT.Y, ! Z\^(y^)\- 1 {{nH i r 1 ph l {m 3 k) 

' v +(r J H 3 )- 1 p]f 3 ( mjk )} 

ijtj k=t 
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Then the following analogues of (2.8) and (2.10) may be derived: 
enu, (fx, . . . , f N \T) - err^ (fx , . . . , f N \T) 

(5-2) = iEE E ^'(y^kT'EipJ^k) -pjjiy^k)} 2 + o(n" 4/5 ) 

k=l 

= T(H 1 ,...,H N )n-^ + o(n- i / 5 ). 

5.2. Implications of (5.2). Our assumptions imply that no three graphs 
of the functions pifi cross at a single point y € Z, and, indeed, (5.2) fails in 
such cases. Although those cases might be considered rare, Fukunaga and 
Flick (1984) show that they can arise. 

The context directly addressed by (5.2) is that where it is impossible 
to choose H 1 ,...,H N >0 such that (Hi/Hj) 2 = Pjfj (yijk)/ {Pifi (yijk)} for 
each triple of indices k) such that pifi and Pjfj cross at some point y^k £ 
Z. For example, this can be because f'j (yijk) fl' (yijk) < for some (i,j, k), or 
because for some pair (i,j) the ratio f" (yijk)/ fj (yijk) varies with k. Here 
the optimal rate of convergence to zero of the difference in Bayes risk is 
n -4 / 5 , and its minimal size is obtained by choosing hj = Hjn -1 / 5 , where 
H i , . . . , Hn minimizes T(H±, ... ,iJj\r) at (5.1). 

Consider next the case where there is only one nonzero value of i/y , and it 
equals 1. Here the optimal algorithm Ao reduces to distinguishing between 
just two densities, fi and fj say. The empirical algorithm Ax also effectively 
reduces to a two-population one, where the convergence rate can be either 
n -4 / 5 or n~ 8 / 9 . Since this case has already been discussed in Section 2, then 
there is no need to treat it further. 

There are, however, nonpathological instances where J2i<j v ij > 1 an d 
the convergence rate n -8 / 9 , rather than n -4 / 5 , obtains. Consider, for ex- 
ample, the case where, for 1 < j < M (and M < N), the graph of Pjfj 
crosses that of pj + ±fj + \ at a single point, yj say; and no other crossings 
of graphs occur within Z. If, at each crossing, the graphs are all locally 
concave or all locally convex, then, by choosing H\, . . . ,Hm+i such that 
(Hj/H j+1 ) 2 =pj +1 fi' +1 (yj)/{pjfi'(yj)} for 1 <j < M, we ensure that the 
bias contribution to T(H\, . . . , ifjv), that is, the second term in (5.1), van- 
ishes identically. In this case the faster convergence rate of n~ 8 / 9 can be 
obtained by choosing h = Hjn" 1 / 9 throughout. (Choice of hj for j > M + 2 
is relatively unimportant, since the corresponding densities do not cross any 
other density in Z. Nevertheless, taking hj = n -1 / 9 is adequate.) There are 
many related examples of this type. 

5.3. Multivariate populations. Let / and g be densities of ci-variate dis- 
tributions F and G, respectively, where d > 1. We assume classification is 
conducted for new data x coming from a region 7Z, which here plays the role 
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of the interval X in Section 2. The empirical rule .4.1 classifies x as coming 
from F or G, according, as pf(x) — (1 — p)g(x) is positive or negative, where 
on the present occasion, 

X = {Xi, . . . , X m } and y = {Y±, . . . , Y n } are training datasets drawn from 
F and G, respectively, h\ and h% are bandwidths, and K is a bounded, 
spherically symmetric and compactly supported probability density. 

The classification rule Ao that minimizes Bayes risk amounts to classifying 
x as coming from F or G according as A(x) > or < 0, where A = pf — 
(1 —p)g. Let C denote that part of the set {y:A(y) = 0} which lies in 1Z, 
and write 0(y) for the vector of first derivatives of A at y. In place of (2.4) 
and (2.5), we assume that / and g have two continuous derivatives, and are 
bounded away from zero, in an open set containing 1Z, and the function 9 
does not vanish on C. Take each hj to be of size n -1 ' ( d+4 \ Then it may be 
proved that 

err^ (/, g\K) - err^ (/, g\K) 

(5 ' 3) =\ I \\0(y)\r 1 E{pf(y)-(l-p)g(y)} 2 dy + o(n-^ d+ ^). 

J c 

Holmstrom and Klemela (1992) report the results of numerical experi- 
ments on kernel-based classification in the multivariate case. They provide 
no theory, however. 

5.4. Implications of (5.3). Taking hj = Hjn" 1 ^, Taylor expansion of 
the right-hand side of (5.3) may be shown to give 

en Al (f,g\n) - err M (f,g\n) = B(H 1 ,H 2 ) n -^ d+ ^ + o( n -^ d+ ^), 

where the constant B(H\,H2) vanishes for either finite or infinite (.Hi, #2) 
only if V 2 //V 2 5 is constant throughout C, with V 2 ^ denoting the Lapla- 
cian. Therefore, in virtually all cases there exists an optimal pair {H\,H2) = 
(H^H®) which minimizes B(H\,H2). Then the optimal bandwidths = 

Hjn~ 1 /( d+4 ) are of size n~~ which is the same size that leads to min- 

imization of mean squared error of / and g as pointwise estimators of / 
and g. 

6. Numerical properties. We summarize a simulation study addressing 
properties of the empirical bandwidth selector introduced in Section 4. Re- 
call from Section 2 that there are two main classes of problems, respec- 
tively characterized by the property that the densities / and g intersect 
at a point where the curvatures have different signs or the same sign. Call 
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these classes 1 and 2; they correspond to the optimal bandwidth being of 
size n -1 / 5 or n -1//9 , respectively. We shall report results for two examples 
in each class. Throughout, the distribution with density / was standard 
normal, p= \ and m = n. In the tails of the distributions, in any cases of 
ambiguity we classified using the method suggested in Section 3. 

Classification was done on the entire real line, rather than on a compact 
interval as suggested in our theory. In the first examples, in each of classes 1 
and 2 the densities cross one another at one point in the tails, in addition 
to a crossing in the "middle" of the distribution. However, the tail crossing 
point is so far out that, for the sample sizes we used, it has negligible impact 
on numerical results, and so the effective value of v is 1. The actual value of 
v is 1 for the second example in class 1. For the second example in class 2, 
v = 2. However, there is strong symmetry in this case, with the result that 
theoretical properties are essentially the same as they would be if v were 1. 
Nevertheless, the existence of two crossing points creates potential hazards 
for our empirical bandwidth selector, which is why we treated this example. 

In the first example in class 1, g is the N(— 1.2, 0.6 2 ) density, the crossover 
occurs at y\ = —0.515, and the curvatures there are f"{yi) = —0.255 and 
g"(yi) = 0.281. In the second example in class 1, g is the density for the 
normal mixture 

iN(0,l) + iN(l,(|) 2 ) + |N(i|,(f) 2 ), 

Hi = 0.707, f"(yi) = —0.156 and g"{y\) = 0.327. In the first example in 
class 2, g is the normal N(l,l) density, y\ = 0.5 and f"(yi) = g"(yi) = 
—0.264. In the second example in class 2, g is the Cauchy density g(x) = 
{^(l + x 2 )}" 1 , there are two crossover points yi = ±1.851, and f"(yi) = 0.175 
and g"(yi) = 0.068. Figure 1 illustrates the densities. 

To implement the bootstrap method suggested in Section 4, we used the 
triweight kernel, K{x) = (35/32)(l — x 2 ) 3 for \x\ < 1, and noted that the 
asymptotically optimal bandwidth for estimating f( r ', in terms of minimiz- 
ing mean integrated squared error, is 

f (2r + l)R(K^) Y /{2r+5) 

where R(L) = J L 2 and ^2{L) = J u 2 L{u)du. When constructing estimators 
/ and g mentioned in Section 4, we took r = 4 and chose /J3, /14 using 
the above formula, but (employing a device that might be implemented 
in practice) replaced / by the normal density with zero mean and variance 
estimated from the training data. In the case of the Cauchy density, however, 
estimating scale in this way is inappropriate, and so instead the normalized 
interquartile range was used: 

sample interquartile range 



18 



P. HALL AND K.-H. KANG 




-5-3-1135 -5-3-1135 




-5-3-1135 -5-3-1135 



X X 

Fig. 1. Densities used in simulation study. In each case the density 
f(x) = (27r)~ 1//2 exp(— \x 2 ) is indicated by the dot-dashed line, and the density g 
by the unbroken line. The densities depicted in the two panels in the first and second rows 
correspond to those in the two examples in classes 1 and 2, respectively. 



where <3? denotes the standard normal quartile function. 

The probability P{A*(x) < 0\Xuy} needed to estimate err^ (hi, /12) was 
approximated using 100 bootstrap iterations. Minimization of eir_4 1 (hi, h-z) 
over (h\,h2) was conducted on a fine grid of bandwidths. We simulated 100 
samples for each of 10 logarithmically equally spaced sample sizes from 20 
to 200. a 

Let (hi, hi) denote the empirical bandwidths obtained in this way. For 
each of the four distributions, and for j = 1,2, we plotted — log/ij against 
logn. The results are given in Figures 2 and 3, which correspond to class 1 
and class 2, respectively. In each figure, the two rows of panels give plots that 
correspond to the first and second density pairs, respectively, in that class; 
and the first and second columns of panels show (as black dots) the average 
values (over the 100 independent samples) of the points (— log hi, logn) in 
the case of the left-hand panel, or (— log hi, logn) for the right-hand panel. 
In each of the four panels in each figure, the unbroken line is the conventional 
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2,5 3.5 4.5 5.5 2.5 3.5 4.5 5.5 



log(n) log(n) 

Fig. 2. Plots for two examples in class 1. The two rows of panels show, respectively, 
simulation results for the two pairs of densities in class 1, that is, for the density pairs 
shown respectively in the first and second panels (in the first row) of Figure 1. In the jth 
column of each row the black dots show average values of (— log hj ■, logn), computed as 
described in Section 6. The unbroken line is the conventional least-squares regression line 
through these points, and the dotted and dashed lines are drawn so that they have respective 
slopes g and | , and pass through the center of the least-squares regression line. 



least-squares regression line through these points. The dotted and dashed 
lines have slopes ^ and |, respectively, with intercepts chosen so that each 
of these lines passes through the center of the least-squares regression line. 

The main point to note from the figures is that in the case of density pairs 
from class 1, the slope of the least-squares regression line is very close to | 
(see Figure 2), while for class 2 it is close to § (see Figure 3). This, of course, 
reflects the theoretical results presented in Sections 3 and 4, where we showed 
that these particular slopes determine the optimal orders of bandwidth in the 
respective classes. The agreement between theory and numerical simulation 
is somewhat better in the case of class 1, but note that in the second class 
the numerical results clearly reflect the theory even in the Cauchy case. 



20 



P. HALL AND K.-H. KANG 





log(n) log{n) 

Fig. 3. Plots for two examples in class 2. Details are as for Figure 2, except that the two 
rows of panels show results for the two pairs of densities in class 2. These density pairs 
are depicted in the first and second panels, respectively, in the last row of Figure 1. 



7. Reasons for failure of erf ^ (hi, ^2)5 at (4.1), to provide effective min- 
imization of Bayes risk. Failure occurs because the optimal bandwidths, 
discussed in Section 2.3, are determined by properties of mean squared er- 
ror at isolated points, that is, the points where the graphs of pf and (1 —p)g 
cross. See (2.8). Cross-validation does not accurately estimate mean squared 
error at a point, unless one averages over neighboring points in a sufficiently 
wide interval. See, for example, the modifications of cross-validation that 
are necessary when it is used for local, as distinct from global, bandwidth 
choice [Hall and Schucany (1989) and Mielniczuk, Sarda and Vieu (1989)]. 
The same sort of averaging is required here, too, and so the use of sub- 
sidiary smoothing parameters is necessary to overcome the failure of cross- 
validation. That substantially reduces the attractiveness of the method. 

To appreciate these difficulties from a theoretical viewpoint, note that 
in order for the criterion defined at (4.1) to perform its function, it must 
equal err_4 1 (/ti, /12), plus terms which either do not depend on (/ti,^) or 
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which depend on that quantity but are of smaller order than r\ = (mhi)^ 1 + 
{nh 2 )~ l +hf + h\. (We shall say that such terms are of "type T.") It is not 
difficult to see that this must be true of both series on the right-hand side 
of (4.1); there cannot, in general, be judicious cancellation between the two 
quantities. In particular, 

i rn 

sfaM) = -Y l I {&fM x i)<°> x i£ 1 } 

1=1 

must equal s(h\,h 2 ) = JjP{pf < (1 ~P)9}fi P ms terms of type T; call this 
property Pi. We shall outline a theoretical argument showing that, in gen- 
eral, Pi fails to hold. 

For simplicity, let us take p = |, and h\ and h 2 both to lie within the 
interval Ti = [n" 1 / 5 ^, n~ 1 / 5 C7 2 ], where < C\ < 1 < C 2 < 00. We assume, 
too, that m/n has a finite, nonzero limit, and that / and g cross at a unique 
point y in X, at which A'(y) 7^ and the curvatures of / and g have different 
signs. The argument we shall employ to prove that Pi fails in this case can 
be used to show that it fails more generally. 

Put 

5 = m~ 1 ^/{A(X i ) <0,X t £1} and U(hi, h 2 ) = S(hi, h 2 ) - S . 

i 

It is straightforward to show that E{S(hi,h 2 )} = s(hi,h 2 ) + 0(77), and, of 
course, So does not depend on h\ and h 2 . We shall prove that var{[/ (hi,h 2 )} 
is asymptotic to n~ l multiplied by a bounded function which depends nonde- 
generately on (v,w) = [n l ^hi,n l ^h 2 ). Call this property P2, and note that 
T] 2 = o(n~ 1 ) uniformly in h\,h 2 € TL. It may also be proved that U{h\,h 2 ) 
is asymptotically normally distributed, and converges weakly to a Gaussian 
process indexed by (v,w) 6 [Ci,C 2 ]. These results imply that Pi fails. 

Note that vai{U(hi,h 2 )}, being the variance of a sum, can be expanded 
as a sum of diagonal terms, plus a double series in off-diagonal terms. 
It is relatively straightforward to show that the sum of diagonal terms 
equals 0(77). Therefore, it suffices to show that P2 applies to the double se- 
ries in off-diagonal terms contributing to the variance. That quantity equals 
(1 — m~ 1 )Q, where 

Q = cov[J{A / _i(Xi),Xi eX}-I{A(Xi) <o,Xi el}, 

I{A f ^ 2 (X 2 ),X 2 el}- I{A(X 2 ) < 0, X 2 € 1}], 

and so it is adequate to prove that P2 applies to Q. 

Define £ = {(n - l)^}" 1 , f(x)=ZE W K{(x-X i )/h 1 }, 6 1 (x 1 ,x 2 ) = 
£K{(xi - x 2 )/h!}, 5 2 (u) = £K(u), pj = P{f(xj) - g( Xj ) + 5i(xi,x 2 ) < 0}, 
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qj = P{f-j(xj) — g{xj) < 0} and rj = I{A(xj) < 0}. Let K(x\) denote the 
set of u such that x\ — hu 6 1, and put h = h±. In this notation, 

Q = f^-f^P 1 ~ r i)fe - r 2 ) - (gi - ri)(g 2 - r 2 )}f(x 1 )f(x 2 ) dx\ dx 2 



= h {(a 1 -a 2 )(b 1 -b 2 ) 

JTJK{xi) 

- (ai - a 3 )(&i - 6 3 )}/(xi)/(xi - /iti) dxi du, 
where ai = P{J(x\) - g(x\) < 0} - n, a 2 = -P{-<5 2 (u) < /Oi) - < 0}, 
a 3 = P{-5i(j;i,X 2 ) < f{xi) - g(xi) < 0}, 
bx = P{f{xi - hu) - g(xi - hu) < 0} - I{A(x x - hu) < 0}, 
b 2 = P{-5 2 (u) < f(xi - hu) - g(xi - hu) < 0}, 
63 = P{— 5i(xi — hu,X 2 ) < f(xi — hu) — g(x\ — hu) < 0}. 

It may thus be shown that 



Q~h / {(a 3 - 02)61 + (63 - b 2 )a 1 }f(x 1 )f(x 1 - hu) dx\ du 

JXJK{x{) 

~ 2h / / (63 — b 2 )aif(xi) 2 dxi du 

JXJK(x x ) 

~ — 2/i / / a\b 2 f(x\) 2 dx\du. 

JX JK(X!) 

In the last-written integral, change variable from x\ to z, where x\ = y + h 2 z. 
Then, for arbitrarily small e > 0, Q is asymptotic to 

-h 3 f(y) 2 f [ P {-5 2 (u)<f(y + h 2 z-hu) 

J\u\<n s J\z\<n £ 

(7.1) -g{y + h 2 z-hu) < 0} 

x [I{A(y + h 2 z) > 0} 

- P{f(y + h 2 z) - g{y + h 2 z) > 0}] dudz. 

The probability that occurs as a factor in the integral at (7.1) is asymptotic 
to {mh)~ l l 2 multiplied by a nondegenerate function of (v, w) = [n^^hi, n l /^h 2 ). 
The factor within square brackets in (7.1) is asymptotic to another such func- 
tion. Hence, Q is asymptotic to h 3 / (mh) 1 ^ 2 x n" 1 , multiplied by a function 
of (v,w), as had to be proved. 
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